This first section details an example of the code necessary for cleaning a 5W dataset in R. 5W data is typically messy and in wide format. Additionally, data wrangling is time consuming and an inefficient use of a specialist’s time. The code presented below intends to demonstrate that this monthly (or more frequent) occurence can be greatly automated, allowing for more timely and accurate analysis as well as creating the space for more tailored reporting.
# specifying column types
act_col_types <- c("date", "date", "text", "text","text","text",
"text","text","text","text","text","text","text",
"numeric", "numeric", "text","text","text",
"text","text","text", "numeric", "text",
"date", "date", "text", "numeric", "numeric",
"logical", "numeric", "numeric", "numeric",
"numeric", "numeric", "numeric", "numeric",
"numeric", "numeric", "numeric", "numeric",
"numeric", "numeric", "numeric")
# you're gonna have to live with the error messages,
# since you can't slice before read_excel() argument
act1<- read_excel("database activities 5W.xlsx",
sheet = "data",
skip = 1,
col_types = act_col_types) %>%
clean_names() %>%
remove_empty() %>%
slice(-c(1)) %>% # removes the second row
slice(-c(12055)) # removes the grand total
Instead of showing a long vector of column names, here is a
glimpse()of the renamed dataset:
## Rows: 12,054
## Columns: 43
## $ fecha_reportado_cluster <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA,~
## $ mes_reportado <dttm> 2020-01-31, 2020-01-31, 2020-01-31,~
## $ codigo_de_proyecto <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ nombre_del_proyecto <chr> "Todos y Todas a la Escuela", "Todos~
## $ organizacion_lider <chr> "UNICEF - Fondo de las Naciones Unid~
## $ organizacion_implementadora <chr> "SOCIO 1", "SOCIO 1", "SOCIO 1", "SO~
## $ estado <chr> "Zulia", "Zulia", "Zulia", "Distrito~
## $ pcode1 <chr> "VE23", "VE23", "VE23", "VE01", "VE0~
## $ municipio <chr> "Maracaibo", "Maracaibo", "Maracaibo~
## $ pcode2 <chr> "VE2313", "VE2313", "VE2313", "VE010~
## $ parroquia <chr> "Luis Hurtado Higuera", "Luis Hurtad~
## $ pcode3 <chr> "VE231312", "VE231312", "VE231312", ~
## $ ubicacion <chr> "E. T. Dr. Luis Razetti Fe Y Alegrí~
## $ latitud <dbl> 10.596934, 10.596934, 10.596934, 10.~
## $ longitud <dbl> -71.66896, -71.66896, -71.66896, -66~
## $ actividad_full <chr> "CLEDU/CA2.09: Formación docente y o~
## $ actividad_codigo <chr> "CA2.09", "CA2.09", "CA2.09", "CA2.0~
## $ actividad_desc <chr> "Formación docente y otro personal e~
## $ actividad_codigo_desc <chr> "CA2.09: Formación docente y otro pe~
## $ act_covid <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ unidad <chr> "#Personas", "#Personas", "#Personas~
## $ cantidad_del_mes <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ recurrente_beneficiarios <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ fecha_de_inicio <dttm> 2019-10-22, 2019-10-15, 2020-01-13,~
## $ fecha_prevista_finalizacion <dttm> 2020-01-20, 2020-01-13, 2020-01-13,~
## $ estatus_actividad <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ total_beneficiarios <dbl> 16, 13, 41, 2, 1, 25, 19, 19, 17, 17~
## $ check_beneficiaries <dbl> 16, 13, 41, 2, 1, 25, 19, 19, 17, 17~
## $ t_f_ben <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, ~
## $ percent_poblacion_indigena <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ percent_personas_con_discapacidad <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ m_0_3 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ m_3_6 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ m_7_12 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ m_12_17 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ m_18_19 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ f_0_3 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ f_3_6 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ f_7_12 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ f_12_17 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ f_18_19 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ f_mayores_de_19 <dbl> 11, 4, 41, 2, 1, 20, 10, 10, 5, 8, 3~
## $ m_mayores_de_19 <dbl> 5, 9, 0, 0, 0, 5, 9, 9, 12, 9, 13, 1~
# function to remove accents
rm_accent <- function(colns){
colns <- stri_trans_general(colns, "Latin-ASCII")
}
# removing accents
act1 <- act1 %>%
mutate(estado = rm_accent(str_to_upper(estado)), # string to upper and rm_accent
municipio = rm_accent(str_to_upper(municipio)),
parroquia = rm_accent(str_to_upper(parroquia)),
ubicacion = rm_accent(str_to_upper(ubicacion)),
actividad_desc = rm_accent(str_to_upper(actividad_desc)))
And mutating a new date column just with the month
# recoding
act1 <- act1 %>%
mutate(recurrente_beneficiarios =
recode(recurrente_beneficiarios,
"no" = FALSE, "No" = FALSE, "Si" = TRUE, "si" = TRUE, "sí" = TRUE, "Sí" = TRUE),
act_covid = recode(act_covid,
"no" = FALSE, "No" = FALSE, "Si" = TRUE, "si" = TRUE, "sí" = TRUE, "Sí" = TRUE)) %>%
replace_na(list(recurrente_beneficiarios = FALSE, act_covid = FALSE))
# recoding estatus_actividad
act1$estatus_actividad <- act1$estatus_actividad %>%
str_replace_all(c("En ejecucion" = "ejecucion",
"en ejecución" = "ejecucion",
"en Ejecución" = "ejecucion",
"En ejecución" = "ejecucion",
"En Ejecución" = "ejecucion",
"finalizada" = "finalizada",
"Finalizada" = "finalizada"))
# create a new variable of last day of month only
act1$mes_solo <- ceiling_date(act1$mes_reportado, "month") - days(1)
locations.csvAnd check if you need to update it
locations <- read_csv("locations.csv") %>%
mutate(estado = rm_accent(str_to_upper(estado)),
municipio = rm_accent(str_to_upper(municipio)),
parroquia = rm_accent(str_to_upper(parroquia)),
ubicacion = rm_accent(str_to_upper(ubicacion)))
# see if you need to update locations
locations_add <- act1 %>%
select(estado, pcode1, municipio, pcode2, parroquia, pcode3, ubicacion) %>%
distinct() %>%
anti_join(locations, by = "ubicacion")
This section reads in the locations reference dataset then uses it to clean
adm_dirtyand rewrites it toadm_dirty1. Thedistinct()call at the end is to ensure that no duplicates are included as theleft_join()argument will product duplicates. After which,adm_cleanandadm_dirty1are combined intoact2.
As a note, there is a to include a set of new pcodes – “todo municipio” and “todo estado”.
# splitting into two datasets, one clean one dirty.
adm_dirty <- act1 %>%
filter(is.na(estado) | is.na(pcode1) |
is.na(municipio) | is.na(pcode2) |
is.na(parroquia) | is.na(pcode3))
adm_clean <- act1 %>%
filter(!is.na(estado) & !is.na(pcode1) &
!is.na(municipio) & !is.na(pcode2) &
!is.na(parroquia) & !is.na(pcode3))
coalesce()# filling in missing values
adm_dirty <- adm_dirty %>%
left_join(locations, by = "ubicacion") %>%
mutate(estado = coalesce(estado.x, estado.y),
pcode1 = coalesce(pcode1.x, pcode1.y),
municipio = coalesce(municipio.x, municipio.y),
pcode2 = coalesce(pcode2.x, pcode2.y),
parroquia = coalesce(parroquia.x, parroquia.y),
pcode3 = coalesce(pcode3.x, pcode3.y)) %>%
select(-estado.x, -estado.y, -pcode1.x, -pcode1.y,
-municipio.x, -municipio.y, -pcode2.x, -pcode2.y,
-parroquia.x, -parroquia.y, -pcode3.x, -pcode3.y) %>%
distinct() # removing duplicate rows from the join
# bind_rows() does not care about column sequence
# but put adm_clean first so that the original order is preserved
act2 <- bind_rows(adm_clean, adm_dirty)
printing total beneficiaries for
act1andact2, followed bynrow()for both.
## [1] 28965878
## [1] 28965878
## [1] 12054
## [1] 12054
This is to check that
total_beneficiariosis equal to the sum of all disaggregated columns. First, we mutate a new column by summing all the disaggregated beneficiaries and calling ittotal_ben_check. The next function prints the differences betweentotal_ben_checkandtotal_beneficiarios. Finally, we mutate a new column in the dataset calledno_esp_benor beneficiarios no especificados to collect the differences and ensure that we do not lose them.
# mutating bencheck and unspecified beneficiaries columns
act2 <- act2 %>%
mutate(total_ben_check = select(., m_0_3:m_mayores_de_19) %>% rowSums(na.rm = TRUE),
no_esp_ben = round(total_beneficiarios) - round(total_ben_check))
Next, we print rows where the sum of disaggregated beneficiaries do not match the totals reported in the
total_beneficiarioscolumn.
| organizacion_implementadora | actividad_desc | ubicacion | total_ben_check | total_beneficiarios | no_esp_ben |
|---|---|---|---|---|---|
| SOCIO 2 | PROMOCION MENSAJES CLAVES PARA LA COMUNIDAD ESCOLAR | U.E.C. DIVINA PASTORA (CATEDRAL) | 0 | 244 | 244 |
| SOCIO 4 | INICIATIVAS PARA REINSERCION EDUCATIVA DE NNA FUERA DE LA ESCUELA | U.E. SAN JOSE DE COTIZA | 18 | 26 | 8 |
| SOCIO 6 | FORMACION DOCENTE Y OTRO PERSONAL EDUCATIVO | MINISTERIO DEL PODER POPULAR PARA LA EDUCACION | 0 | 398 | 398 |
| SOCIO 6 | FORMACION DOCENTE Y OTRO PERSONAL EDUCATIVO | CLUSTER DE EDUCACION | 0 | 146 | 146 |
# timestamp for all the files to be written below
now <- Sys.time()
For submissions to OCHA, this one aligns with their format
# turn on eval if you want the file
filename_wide <- paste0(format(now, "%y%m%d_%H%M_"), "5W_wide.csv")
write_csv(act2, filename_wide)
This section pivots the dataset longer into tidy data which, with one observation per instance of age and sex; this is necessary for easy plotting and additional operations. Additionally, rows with no beneficiaries are filtered out.
The dataset
u_benis created by taking the highest number of beneficiaries by age and sex group by location. The columns are then filtered down to create a more usable dataset. We will use this dataset for any operations related to unique beneficiaries (individuals), including comparisons with census data to determine reach and coverage. Please note that not all activities will be represented in this dataset as only the max of beneficiaries per disaggregation group per location was selected.
# pivoting longer and creating new dataframe
u_ben <- act2 %>%
select(-t_f_ben, -total_ben_check) %>% # remember to drop them
relocate(mes_solo) %>% # moves mes_solo to the first variable
pivot_longer(m_0_3:no_esp_ben,
names_to = "desagregacion", values_to = "beneficiarios") %>%
filter(recurrente_beneficiarios == FALSE) %>%
filter(beneficiarios != 0) %>% # empty cells are 0 in the 5W table
group_by(ubicacion, desagregacion) %>%
slice(which.max(beneficiarios)) %>%
ungroup() %>%
select(mes_solo, estado, pcode1, municipio, pcode2,parroquia, pcode3,
ubicacion, latitud, longitud,
desagregacion, beneficiarios)
# writing csv of u_ben with datestamp
# turn on eval if you want the file
filename_u_ben <- paste0(format(now, "%y%m%d_%H%M_"), "u_ben.csv")
write_csv(u_ben, filename_u_ben)
The dataset
act_benis for activity totals. Only the rows marked asrecurrente_beneficiarios == FALSEhave been selected. The first instance of recurring beneficiaires is marked FALSE, with all subsequent entries being marked TRUE. Similar tou_ben, it has also been pivoted longer into tidy data.
Please not that we can only assure that there no duplications within locations as specific locations might have been targetted by more than one activity. For geographic operations, please use
u_ben.
# This is the dataset for beneficiaries
act_ben <- act2 %>%
select(-t_f_ben, -total_ben_check) %>% # remember to drop them
relocate(mes_solo) %>% # moves mes_solo to the first variable
filter(recurrente_beneficiarios == FALSE) %>%
pivot_longer(m_0_3:no_esp_ben,
names_to = "desagregacion", values_to = "beneficiarios") %>%
filter(beneficiarios != 0) %>% # empty cells are 0 in the 5W table
group_by(ubicacion, desagregacion) %>%
select(mes_solo, nombre_del_proyecto, organizacion_implementadora,
estado, pcode1, municipio, pcode2, parroquia, pcode3,
ubicacion, latitud, longitud,
actividad_codigo, actividad_desc, act_covid,
desagregacion, beneficiarios)
# writing csv act_ben with datestamp
# turn on eval if you want the file
filename_act_ben <- paste0(format(now, "%y%m%d_%H%M_"), "act_ben.csv")
write_csv(act_ben, filename_act_ben)
u_ben, act_ben and difference## [1] 3660609
## [1] 5673040
## [1] 2012431
This is an entirely automated report – all charts and tables, as well as all figures within the report have been generated from the data, with no manual input. This report makes use of the outputs of the
5W_cleaningsection above. This set of 5W data pertains to the Education Cluster in Venezuela and has had partner information removed. Code will not be shown in this section.
| actividad | total | percent_of_total | male | female | sex_ratio |
|---|---|---|---|---|---|
| DISTRIBUCION DE KITS DE MATERIALES ESCOLARES | 471,568 | 50.91 | 232,197 | 239,372 | 0.97 |
| ALIMENTACION ESCOLAR | 156,472 | 16.89 | 72,035 | 84,437 | 0.85 |
| EDUCACION A DISTANCIA | 132,258 | 14.28 | 64,240 | 68,018 | 0.94 |
| APOYO PSICOEDUCATIVO PARA NNA | 120,887 | 13.05 | 54,859 | 66,028 | 0.83 |
| FORMACION DOCENTE Y OTRO PERSONAL EDUCATIVO | 17,238 | 1.86 | 3,470 | 13,224 | 0.26 |
| ACTIVIDADES CON ADOLESCENTES Y JOVENES DE NIVELACION, HABILIDADES PARA LA VIDA Y CAPACITACION TECNICA | 14,016 | 1.51 | 6,336 | 7,680 | 0.82 |
| BECAS Y OTROS INCENTIVOS PARA DOCENTES Y PERSONAL | 5,744 | 0.62 | 1,572 | 4,172 | 0.38 |
| ACTIVIDADES RECREATIVAS | 4,317 | 0.47 | 1,998 | 2,319 | 0.86 |
| INICIATIVAS PARA REINSERCION EDUCATIVA DE NNA FUERA DE LA ESCUELA | 3,869 | 0.42 | 2,114 | 1,755 | 1.2 |
A total of 728,408 individuals have been reached to date. In terms of frequencies (inclusive of double counting), 926,369 have been reached.
Additionally, the 4,746,671 beneficiary frequencies reached by the activity PROMOCION MENSAJES CLAVES PARA LA COMUNIDAD ESCOLAR have been removed from the totals in this report as the activity consists of solely radio messaging.
figures are unique beneficiaries/individuals
With reference to the 2017 Matricula dataset, we can see that the Education programme has reached an overall average of 9% of schoolgoing children aged 3-17 nationwide. Children aged 3-17 consitute 88% of all UNICEF beneficiaries.
| Edad grupo | beneficiarios | matricula2017 | percent_total |
|---|---|---|---|
| 3-6 | 121,169 | 1,438,475 | 8.423 |
| 7-12 | 352,808 | 3,252,505 | 10.85 |
| 12-17 | 169,771 | 2,205,724 | 7.697 |
The number of individuals reached has increased by 175,005 in the past month, reaching a total of 728,408. The number of beneficiary frequencies reached has increased by 218,443 in the same period, reaching a total of 926,369.
mouse over to see details
Progress in recent months has largely been due to the distribution of education kits and distance learning.
A total of 2,229 schools have been reached by UNICEF; 44% are from Miranda and Zulia alone.
logarithmic scale; larger points indicate more beneficiaries reached, darker blues indicate more activity types
mouse over municipalities to see beneficiaries and distinct activities
A total of 110 municipalities were reached by the UNICEF Education programme.
| estado | municipio | beneficiarios |
|---|---|---|
| DISTRITO CAPITAL | LIBERTADOR | 80482 |
| MIRANDA | SUCRE | 59176 |
| ZULIA | MARACAIBO | 55370 |
| BOLIVAR | CARONI | 37908 |
| ZULIA | SAN FRANCISCO | 29369 |
| AMAZONAS | ATURES | 22430 |
| LARA | IRIBARREN | 21977 |
| BOLIVAR | HERES | 18293 |
| DELTA AMACURO | TUCUPITA | 17953 |
| BOLIVAR | CEDENO | 15681 |
| estado | municipio | coverage_percent |
|---|---|---|
| TACHIRA | FERNANDEZ FEO | 87 |
| TACHIRA | AYACUCHO | 79 |
| ZULIA | MACHIQUES DE PERIJA | 79 |
| AMAZONAS | AUTONOMO AUTANA | 76 |
| TACHIRA | SAMUEL DARIO MALDONADO | 75 |
| TACHIRA | PANAMERICANO | 72 |
| MIRANDA | PLAZA | 68 |
| TACHIRA | INDEPENDENCIA | 68 |
| TACHIRA | JUNIN | 65 |
| MIRANDA | EL HATILLO | 64 |
Together, the 10 municipalities with the highest reach (above left) form 49% of the 728,408 beneficiaries reached. The average coverage of the school-age population in the municipalities where UNICEF is present is 20%. Coverage refers to the percentage of enrolled children (aged 3-17 years) reached by UNICEF.
Below is a histogram of munciipalities where UNICEF is present showing the coverage of enrolled children (aged 3-17). Of note, we have reached 10% or less of the population in 55 out of the 110 in which we operate. This is in addition to the 226 where no UNICEF Education activities have occurred.
| partner | SOCIO 2 | SOCIO 1 | SOCIO 4 | SOCIO 5 | UNICEF | SOCIO 3 | SOCIO 6 | SOCIO 7 | SOCIO 10 | SOCIO 8 |
| act_types | 9 | 8 | 8 | 6 | 6 | 4 | 2 | 2 | 1 | 1 |
mouse over for details
| organizacion_implementadora | beneficiarios | percent_of_total | male | female | sex_ratio | municipalities |
|---|---|---|---|---|---|---|
| UNICEF | 353,944 | 38.21 | 168,281 | 185,664 | 0.91 | 74 |
| SOCIO 1 | 160,260 | 17.3 | 79,958 | 80,302 | 1 | 92 |
| SOCIO 2 | 145,997 | 15.76 | 65,390 | 80,607 | 0.81 | 51 |
| SOCIO 5 | 113,845 | 12.29 | 50,511 | 63,334 | 0.8 | 10 |
| SOCIO 4 | 86,074 | 9.29 | 42,534 | 43,540 | 0.98 | 17 |
| SOCIO 7 | 31,322 | 3.38 | 14,750 | 16,572 | 0.89 | 10 |
| SOCIO 8 | 29,471 | 3.18 | 15,379 | 14,092 | 1.09 | 17 |
| SOCIO 10 | 2,461 | 0.27 | 1,151 | 1,310 | 0.88 | 1 |
| SOCIO 3 | 2,423 | 0.26 | 862 | 1,561 | 0.55 | 3 |
| SOCIO 6 | 572 | 0.06 | 5 | 23 | 0.22 | 7 |
use UNICEF_present to filter to municipalities where the Education programme operates
CA01.05 Promocion de mensajes claves para la comunidad escolar is not included